Multiple linear regression: exploring interaction (Notes)

STAT 155

Author

Your Name

Notes

Learning goals

By the end of this lesson, you should be able to:

  • Describe when it would be useful to include an interaction term to a model
  • Write a model formula for an interaction model
  • Interpret the coefficients in an interaction model in the data context

Readings and videos

Today is a day to discover ideas, so no readings or videos to go through before class.

File organization: Save this file in the “Activities” subfolder of your “STAT155” folder.

Class exploration

Guiding question: What job sectors have the highest return on education?

We’ll use data from the 2018 Current Population Survey to explore. The codebook for this data is available here. For now we’ll focus on individuals who have jobs in the management or transportation sectors to simplify our explorations.

# Load packages and data
library(readr)
library(ggplot2)
library(dplyr)
cps <- read_csv("https://mac-stat.github.io/data/cps_2018.csv")

# Get data on just the management and transportation sectors
cps_sub <- cps %>% 
    filter(industry %in% c("management", "transportation"))

It would be great to know the true effect of years of education on wages. Let’s start by looking at the relationship between these two variables.

ggplot(cps_sub, aes(x = education, y = wage)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

There is a positive correlation between years of education and wages, with a fair bit of spread about the line of best fit. We can fit a simple linear regression model to obtain the intercept and slope of that line.

wage_mod_1 <- lm(wage ~ education, data = cps_sub)
coef(summary(wage_mod_1))
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) -59797.480  6375.9627 -9.378581 1.010664e-20
## education     8274.492   410.9261 20.136206 1.379791e-86

Note that our intercept estimate is negative which doesn’t make sense! People with 0 years of education still earn wages! Inspecting the plot, we likely could have better accounted for this with a nonlinear transformation of the education variable, but we will leave this issue aside for now.

Let’s focus on this question: does this simple linear regression model help us understand the true effect of years of education?

We’ll want to consider confounding variables in order to better answer that question. One possible confounder is industry. Draw a causal diagram showing how industry, years of education, and wage relate, and explain what unfair comparisons result from using a simple linear regression model.

Let’s adjust for industry by fitting a multiple linear regression model.

wage_mod_2 <- lm(wage ~ education + industry, cps_sub)
coef(summary(wage_mod_2))
##                          Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            -54505.666  7274.6504 -7.492548 8.015884e-14
## education                7973.563   456.6417 17.461311 3.244777e-66
## industrytransportation  -5420.432  3589.2564 -1.510182 1.310637e-01

Interpret the education and industrytransportation coefficients in the context of the data. (Remember to include units.) How does the relationship between years of education and wages change after adjusting for industry?

Hold on! We sped ahead too quickly. It’s important to visualize our data thoroughly first. Let’s add industry to our original scatterplot. What do you notice about the lines of best fit for these two industries?

ggplot(cps_sub, aes(x = education, y = wage, color = industry)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

It would be nice to be able to capture the different trend for the two industries. Let’s see if our original multiple linear regression model (wage_mod_1) was able to do so.

# Visualize the relationships from the fitted model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) + 
    geom_line(aes(y = wage_mod_2$fitted.values))

What do you notice about what our model produces? Based on your coefficient interpretations from earlier, is this behavior what you would have expected? How do you think our multiple linear regression model is limited? How might we try to fix this?

In our causal diagram, both years of education and industry affect wages, and one way to capture this is with our model in wage_mod_2:

\(E[\text{wage} \mid \text{education}, \text{industry}) = \beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation}\)

Some other ways to capture how wages are affected by years of education and industry could look like this:

  • \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}^2\)
  • \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \log(\text{education})\)
  • \(\beta_0 + \beta_1 \text{education} + \beta_2 \text{industrytransportation} + \beta_3 \text{education}*\text{industrytransportation}\)

That last type of model is called an interaction model. A general interaction model formula looks like this:

\(E[Y \mid X_1, X_2) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1*X_2\)

The outcome \(Y\) depends on \(X_1\) and \(X_2\) with the usual multiple linear regression part: \(\beta_1 X_1 + \beta_2 X_2\). But it also includes an interaction term \(\beta_3 X_1*X_2\).

Let’s fit an interaction model for our cps_sub data and explore what relationships our model estimates.

# Fit an interaction model
# NEW SYNTAX: Note the * instead of +
wage_mod_2 <- lm(wage ~ education * industry, cps_sub)

# Visualize the relationships from the interaction model
ggplot(cps_sub, aes(y = wage, x = education, color = industry)) + 
    geom_line(aes(y = wage_mod_2$fitted.values))

How does our new interaction model compare to our previous one?

This is a more complex new model! Let’s explore what is going on mathematically by examining the overall model formula and how we can use it to get model formulas for each industry.

# View coefficient estimates
coef(summary(wage_mod_2))
##                                    Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                      -65590.606   7611.498 -8.617306 9.218848e-18
## education                          8678.274    478.344 18.142326 3.646579e-71
## industrytransportation            90232.230  20117.859  4.485181 7.457878e-06
## education:industrytransportation  -7580.228   1568.831 -4.831768 1.396075e-06

Model formulas:

E[wage | education, industry] = -65590.606 + 8678.274 education + 90232.230 transportation - 7580.228 education * transportation

Broken down by industry:

Management:

E[wage | education, industry = management] = -65590.606 + 8678.274 education

Transportation:

E[wage | education, industry = transportation] = -65590.606 + 8678.274 education + 90232.230 - 7580.228 education = (-65590.606 + 90232.230) + (8678.274 - 7580.228)education
= 24641.62 + 1098.046education

Question 1: The intercept coefficient, -65590.606, corresponds to what property of the lines?

  1. management intercept
  2. transportation intercept
  3. how the transportation intercept compares to the management intercept

Thus, how can we interpret this coefficient in the context of the wage analysis?

Question 2: The transportation coefficient, 90232.230, corresponds to what property of the lines?

  1. management intercept
  2. transportation intercept
  3. how the transportation intercept compares to the management intercept

Thus, how can we interpret this coefficient in the context of the wage analysis?

Question 3: The education coefficient, 8678.274, corresponds to what property of the lines?

  1. management slope
  2. transportation slope
  3. how the transportation slope compares to the management slope

Thus, how can we interpret this coefficient in the context of the wage analysis?

Question 4: The interaction coefficient, -7580.228, corresponds to what property of the lines?

  1. management slope
  2. transportation slope
  3. how the transportation slope compares to the management slope

Thus, how can we interpret this coefficient in the context of the wage analysis?

Exercises

Exercise 1: Wages across all industries

The plot below illustrates the relationship between wage and education for all of the industries in our cps dataset.

# Plot
ggplot(cps, aes(y = wage, x = education, color = industry)) + 
    geom_smooth(method = "lm", se = FALSE)

  1. What about this plot indicates that it would be a good idea to fit an interaction model?

  2. What industry will R use as the reference category?

  3. (Challenge!) Before fitting the model in R, write down what you think the model formula will look like.

  4. Fit a model that includes an interaction term between education and industry.

# Fit an interaction model called wage_model


# Display summarized model output
  1. In what industry do wages increase the most per additional year of education? What is this increase?

  2. Similarly, in what industry do wages increase the least per additional year of education? What is this increase?

Exercise 2: Thinking beyond

Do you think there are other variables (which may or may not be in our cps data) that have an interaction with industry in affecting wages? If you were to fit an interaction model, what results might you expect to find?

Reflection

Through the exercises above, we developed ideas about when to fit interaction models and how to interpret results. Describe what makes sense and what is still unclear about this topic.

Response: Put your response here.

Done!

  • Finalize your notes: (1) Render your notes to an HTML file; (2) Inspect this HTML in your Viewer – check that your work translated correctly; and (3) Outside RStudio, navigate to your ‘Activities’ subfolder within your ‘STAT155’ folder and locate the HTML file – you can open it again in your browser.
  • Clean up your RStudio session: End the rendering process by clicking the ‘Stop’ button in the ‘Background Jobs’ pane.
  • Check the solutions in the course website, at the bottom of the corresponding chapter.
  • Work on homework!